Foot-based Intonation for Text-to-Speech Synthesis using Neural Networks

نویسندگان

  • Mahsa Sadat Elyasi Langarani
  • Jan van Santen
چکیده

We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Study on Unit-Selection and Statistical Parametric Speech Synthesis Techniques

One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...

متن کامل

Data-driven foot-based intonation generator for text-to-speech synthesis

We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitte...

متن کامل

Synthesizing intonation of standard arabic language

In this paper, we propose a model to generate fundamental frequency (F0) contours using neural networks. A learning procedure is proposed as an alternative to synthesis-by-rules. The generation of correct fundamental frequency contour is one of the important issues in the naturalness of automatic text-to-speech conversion systems. The proposed approach is based on a standard feed-forward multi-...

متن کامل

Application of Neural Networks for POS Tagging and Intonation Control in Speech Synthesis for Polish

The paper describes use of neural networks in POS (part-of-speech) tagging and intonation control, needed in a speech synthesis system for the Polish language. Feedforward multilayered perceptrons have been proposed for both purposes. Considerations during planning the network architecture, used training data, training process and verification of the results are described.

متن کامل

An Overview of Prosodic Modelling for Croatian Speech Synthesis

In order to include prosody into the text to speech (TTS) systems prosody knowledge needs to be acquired, represented and incorporated. Two main features of prosody important for modelling prosody for TTS systems are duration and F0 contour. There are various approaches to modelling those features and they can be categorized into three main groups: rule based, statistical and minimalistic. Some...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016